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Summary 

As microprocessors increase in power, the economics of 
centralized computing has changed dramatically. At the 
beginning of the 1980s, mainframes and supercomputers 
were often considered to be cost-effective machines for 
scalar computing. Today, microprocessor-based RISC 
(reduced-instruction-set computer) systems have displaced 
many uses of mainframes and supercomputers. Supercom- 
puters are still cost competitive when processing jobs that 
require both large memory size and high memory band- 
width. One such application is array processing. Certain 
numerical operations are appropriate to use in a Remote 
Procedure Call (RPC)-based environment. Matrix multipli- 
cation is an example of an operation that can have a suffi- 
cient number of arithmetic operations to amortize the cost 
of an RPC call. This paper describes an experiment which 
demonstrates that matrix multiplication can be executed 
remotely on a large system to speed the execution over that 
experienced on a workstation. 

Introduction 

The title of this paper was chosen deliberately to be provoc- 
ative. To many people familiar with traditional supercom- 
puting, it sounds ludicrous. To others who are more 
familiar with distributed processing, it is a natural exten- 
sion of current techniques to permit easier access to, and 
more efficient use of, unfamiliar and potentially hard-to- 
program supercomputers. This paper is a description of a 
reasonably successful attempt to split off the numerically 
intensive array-oriented operations from the largely scalar 
parts of application code. 

The economic viability of current supercomputing practice 
is under assault by “killer micros.” Microprocessor-based 
RISC (reduced-instruction-set computer) systems are now 
within a factor of four in speed on scalar code relative to far 
more expensive supercomputers. In some cases, the scalar 
speed of the supercomputer is roughly equal to that of the 
microprocessor-based system. Many programs are only 
partially vectorizable. A supercomputer user with a par- 
tially vectorizable, numerically intensive code is faced with 
a dilemma. The user can run the program locally on a 
workstation or server and wait a long time for an answer, or 
the user can run it on a supercomputer, such as the Cray 
Y-MP, and get the answer back faster but “waste” many far 
more valuable Cray CPU cycles. Supercomputers are 
designed to overcome the memory-bandwidth bottleneck 
that prevents large array-oriented programs from running 
fast on less expensive machines. But scalar code does not 
make use of that expensive memory bandwidth during 
most of its execution time. This situation motivated an 
experiment to develop an application in which the scalar 
and vector parts could be conveniently split apart. 


The ideal for a distributed system is to run the scalar parts 
of a program on a low-cost workstation or server and to run 
the vector, or parallel, part on a suitable minisupercom- 
puter, supercomputer, or massively parallel computer. Such 
computers are optimized for high speed on large-memory, 
numerically intensive tasks. With the current state of the 
art, it is very costly and time consuming to do this for indi- 
vidual applications. The cost of developing the code to split 
an existing application into scalar and vector parts is still 
very high. For this experiment, a different approach was 
chosen. A standard subroutine library, Level 3 Basic Lin- 
ear Algebra Subprograms (BLAS3) (ref. 1), was chosen as 
the user interface. This is the natural choice, since many 
users already have calls to BLAS3 routines, in particular 
SGEMM, for which optimized versions exist on Cray, Con- 
vex, and other vector machines. 

This paper describes a distributed application and the 
results of running it on various processors; the paper will 
show that it is feasible to split off large-array operations to 
another machine and perform the operations via RPC 
(Remote Procedure Call). Then, it will describe the limita- 
tions which apply to the current Cray system that restrict 
use of the technique to large problems. 

The authors gratefully acknowledge the many constructive 
comments made by the reviewers. 

1. Description of the Experiment 

1.1 Motivation 

The purpose of this experiment was to demonstrate the fea- 
sibility of developing a distributed application in which the 
compute-intensive part of the code could be conveniently 
separated from the scalar, or less compute-intensive, com- 
putations. The distributed application could then use net- 
work services to allow the partitioning of different parts of 
the application to separate and diverse computing 
resources that are best suited for the particular task. 

This application demonstrates a distributed application that 
provides a user-level, FORTRAN-type subroutine call 
interface to the server process. The server was written to 
utilize a standard matrix multiplication subroutine, 
SGEMM. This subroutine was chosen because many users 
already have calls to SGEMM in their codes and because 
SGEMM is a standard routine for multiplying large arrays 
of data and is readily available in optimized form on many 
computer systems. 

With the distributed-application approach used here, it is 
not necessary for the user to program the remote machine 
or convert any code. Instead, a library is linked to an exist- 
ing program, and the computation takes place remotely 
with only limited involvement of the user. 



1.2 Description of Application Design 

The application created for this experiment is divided into 
two logical pieces, a client process and a server process. 
The client process contains the scalar code and initiates 
requests to the server process to perform some action (in 
this case, the compute-intensive process of matrix multipli- 
cation). The server process waits for a request to be made 
by the client and then performs the requested action. The 
client and server processes may run on the same computer 
system, or the server may run remotely on an independent 
system that is connected by a communication network to 
the client system. 

The remote-procedure call model is similar to the usual 
local-procedure call model in that a single thread of control 
logically winds through the client (caller) process and 
through the server process. The caller process sends a mes- 
sage to the server process and waits for a reply. The mes- 
sage contains the parameters to be passed to the remote 
procedure. The reply contains the procedure’s results. 
When the reply is received, the results arc made available 
to the caller and the caller’s execution is resumed. 

The server is normally waiting for the arrival of a message 
from a caller. When a message arrives, the server process 
extracts the parameters for the remote procedure, calls a 
dispatch routine, performs the requested service, sends 
back a reply to the caller, and then waits for the next 
message. 

To create the client/server interface code, networking ser- 
vices developed by Sun Microsystems were used (refs. 2 
and 3). The Sun Remote Procedure Call (RPC) facility is a 
library of procedures that implement the logical clicnt-to- 
server communications to support network applications. 
Sun RPC was used because it is implemented on a variety 
of operating systems, including Cray Unicos, SGI Irix, 
SunOS, Convex OS, and DEC Ultrix (Ultrix was lacking 
the RPC protocol compile? rpegen). 

The details of programming applications to use RPC can be 
tedious. One of the more difficult areas is passing data in a 
portable format between different computer architectures. 
The external Data Representation (XDR) is a standard for 
the description and encoding of data. XDR uses a language 
to describe data formats in a concise manner. XDR library 
routines may be used to encode data from the local machine 
representation into a standard machine-independent for- 
mat. The machine that receives the data uses XDR to 
decode the data from the standard representation to its own 
internal format. XDR relics heavily on the IEEE standard 
for floating-point data representation and is implemented 
most efficiently on architectures that use the IEEE 
standard. 


The use of RPC procedures and the writing of XDR rou- 
tines to convert procedure arguments and results into XDR 
network format, and vice versa, are facilitated by using 
rpegen. rpegen accepts a remote program interface defini- 
tion written in the RPC language. It produces a C-language 
output for RPC programs. The output includes skeleton 
versions of the client and the server routines, XDR routines 
to handle both the parameters and results of the procedure, 
and a header file that contains common definitions. The cli- 
ent and server skeletons contain calls to RPC library proce- 
dures to handle the network communication . The developer 
writes the server procedure and links it with the server skel- 
eton to produce an executable server program. The user 
(client) program that makes local procedure calls is linked 
with the client skeleton. 

13 Application Code 

The protocol specification for our remote program is writ- 
ten in the RPC language and defines the names of the pro- 
cedures, the data types of their input parameters and output 
results, the name of the remote program, and its version 
name. Each procedure name, program name, and version 
name is assigned a number. A remote procedure is uniquely 
identified by its program number, version number, and pro- 
cedure number. 

The code labeled proto . x (listed in the Appendix) shows 
the protocol specification for our remote program that con- 
tains one version, and has six procedure definitions. The 
program name is REMOTE_MATRIX_PROG.The program 
number is selected arbitrarily from the range of numbers 
defined by the RPC protocol for user-defined services. The 
version name is REMOTE_MATRIX_VERS (with version 
no. 1). Six procedure names and numbers are specified with 
definitions of their input parameters and output results. For 
example, “int R_SMATR I XALLOC (int) = 1;” 
specifies a procedure named R_SMATRI XALLOC with an 
integer input parameter and an integer output result (the 
procedure number is 1). “singlep_matrix R_SGEMM 
( sgemm_args ) = 5 ; ” specifies a procedure named 
R_SGEMM (with procedure no. 5). The procedure’s input 
parameters are defined by the previously declared structure 
sgemm_args . sgemm_args contains the parameters 
necessary for calling the matrix multiplication routine 
SGEMM.The remote-procedure output results are 
defined by the structure singlep_niatrix. 
singlep_matrix declares an array of floats that will 
contain the results of the matrix multiplication. 

The server program is designed to utilize the BLAS3 rou- 
tines SGEMM, single-precision matrix multiplication, and 
DGEMM, double-precision matrix multiplication. Theserver 
program contains six remote procedures, two of which 
allocate space for the single- or double-precision result 
matrix, two of which release the allocated space, and one 
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each to call SGEMM and DGEMM. The code labeled 
server. c (listed in the Appendix) shows the six 
procedures that are linked with the server skeleton code 
(generated by rpcgeri) to produce the running server pro- 
gram • server . c contains the procedures that do the 
“real work” of the server program, such as allocating space 
for the result matrix, calling the matrix multiplication rou- 
tine SGEMM or DGEMM, and freeing the allocated memory 
after the result matrix has been returned to the client, 
server .c contains the code that corresponds to the six 
procedures defined in proto, x. The procedure-naming 
convention is to use the remote-procedure name declared in 
the prototype definition (e.g., R_SGEMM), convert it to 
lower-case letters, append an underline (“_”), and add 
the version number (here 1) to produce the name 
r_sgeiran_l. Note that remote procedures always use a 
pointer to their arguments and a pointer to their results 
since these data will be processed by the appropriate XDR 
routines. 

SGEMM and DGEMM perform the matrix-matrix opera- 
tion C - c iAB + pc where a and p are scalars and A, 
B, and C are matrices. They also perform the same basic 
operation with either matrix A or B transposed. 

The code labeled sgemm . c (listed in the Appendix) shows 
one of the client-side procedures (there is another one for 
using DGEMM) that is linked with the client skeleton code 
(generated by rpcgeri) to produce the user-callable inter- 
face to the remote procedure server, sgemm. c defines the 
standard user interface to SGEMM.The first part of the code 
is concerned with checking the validity of the input param- 
eters. If the input parameters are valid, the procedure uses 
an environment variable to obtain the name of the system 
on which the remote server is running. Then, a data struc- 
ture known as a “client handle” is created to be passed to 
the skeleton client routines that will call the remote proce- 
dure. The skeleton routines are called the same way as the 
remote procedures are defined in server • c except that 
the client handle is inserted as a second argument. After 
executing a call to a remote procedure, the result pointer is 
checked to determine if there was an error resulting from a 
failure of the RPC mechanism. If not, the results are avail- 
able for use by the caller. 

1.4 Network and Operating System Environment 

The main network at Ames Research Center (ARC) is 
referred to as ARCLAN. ARCLAN is largely Ethernet 
based. Ethernet physical media are coaxial cable, optical 
fiber, unshielded twisted pair, and some microwave links. 
Link level routers, network level routers, and other hosts 
and IP gateways provide connectivity over the campus. 
Some systems are connected via Fiber Distributed Data 
Interface (FDDI), although most systems are connected 
directly to Ethernet. There are also additional networks 


using UltraNet and Hyperchannel, although they are not 
part of ARCLAN and were not used for this experiment. 

Although a faster medium than Ethernet would have 
improved effectiveness, Ethernet was fast enough to carry 
out the described experiment. 

2. Performance Results 

2.1 Server Times on Various Machines 

The results shown in table 1 are the total user time con- 
sumed by the server. This time includes XDR overhead 
and the time for the multiplication of two square matrices 
using SGEMM. In some cases, the machines were idle. In 
others (Convex, Cray) the machines were fully loaded and 
100% busy. These results are not intended to be an accurate 
benchmark of these machines under controlled conditions. 
The purpose is to characterize the conditions under which 
the computation can be effectively distributed. Several ver- 
sions of SGEMM were tried in order to get the best possible 
times. The rows noted “orig” in table 1 are the original 
SGEMM in BLAS3. The rows denoted “mod” are for a 
version modified by the authors to be cache-contained in 
small, nonvector, RISC workstations. The “lib” version is 
the one supplied by the system vendor, if available. For the 
Cray, this is the Cray-supplied, multitasked, parallel ver- 
sion. All client matrix elements are 32 bits. All server 
matrix multiplies arc 32 bits, except the Cray Y-MP matrix 
multiply, which is 64 bits, although the RPC is done in 
32 bits. 

2.2 Effectiveness Comparisons 

The previous results tell only part of the story. In order to 
compare the effectiveness of distributed computation 
versus local computation, and the effectiveness of distrib- 
uted computation on different systems, several cases will 
be scrutinized more closely. A sample N = 800 case was 
compared between two machines in order to examine the 
overhead. 

Based on an analysis of current production and research 
codes being run at ARC, and considering the relative costs 
and capabilities of RISC workstations and the Cray, a goal 
of achieving 50 MFLOPS on the Cray was established for 
this experiment. 

For the N = 800 case, the following data were gathered for 
the situation where the local workstation was a Sun 4/75 
and the remote server was on a Cray Y-MP8/864: 

Elapsed time on server: 64.4 seconds 

Total CPU time on server (including system time): 

26.4 seconds 

User CPU time on server: 20.9 seconds 
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Tfcble 1. User CPU seconds for an N x N matrix multiply Including XDR overhead 


Array dimension N > 

machine v 

100 

200 

400 

600 

800 

1000 

Sun 4/490 - orig 

0.9 

7.0 

55.4 

185.4 

436.6 

846.9 

Sun 4/490 - mod 

0.4 

3.2 

25.6 

80.5 

198.1 

388.0 

Sun 4/75 - orig 

0.7 

5.5 

42.4 

140.5 

332.3 

645.7 

Sun 4/75 - mod 

0.4 

2.9 

21.4 

70.2 

165.8 

321.2 

SGI 4D/30 - orig 

0.6 

4.4 

34.3 

112.4 

265.7 

514.2 

SGI 4D/30 - mod 

0.3 

2.8 

22.6 

73.4 

175.8 

334.4 

DEC DS/5500 - orig 

0.4 

4.4 

33.1 

110.9 

261.5 

512.5 

DEC DS/5500 - mod 

0.4 

2.8 

22.9 

76.7 

186.6 

366.8 

IBM 320 - orig 

0.6 

3.4 

21.1 

63.8 

145.2 

276.8 

IBM 320 - mod 

0.5 

2.7 

17.7 

52.4 

121.9 

229.0 

IBM 320 - lib 

0.4 

1.7 

8.8 

23.5 

48.9 

86.9 

Convex C210 - mod 

0.8 

3.9 

19.1 

51.3 

106.6 

N/A 

Convex C210-orig 

0.8 

3.7 

18.0 

47.7 

98.6 

N/A 

Convex C210 - lib 

0.8 

3.2 

14.4 

35.0 

68.6 


Cray Y-MP8/864 - lib 

0.3 

1.2 

5.1 

12.0 

20.9 

WEM 


User CPU time on server to do only the multiply: 

3.45 seconds 

Rate on server while doing only the multiply: 

296.8 MFLOPS 

Rate on server (user time only): 49.0 MFLOPS 
User-visible rate of server over network: 15.9 MFLOPS 

User CPU time to do the multiply on local machine: 

157.7 seconds 

Rate on local machine while doing the multiply: 

6.5 MFLOPS 

in this case, accessing the remote server achieved an 
elapsed time speedup of 2.4 over doing the calculation 
locally on the Sun 4/75. This is less than the 4X speedup, 
which is generally considered necessary to interest a user in 
a significantly different computing method. For smaller 
arrays, the benefits are less, or even negative. However, on 
the plus side, the rate on the Cray is close to the established 
goal of 50 MFLOPS. 

On further analysis, the Cray is unusual when compared 
with other machines: the XDR overhead is relatively much 
greater. For example, the XDR conversion of the input 


matrices costs 5.1 seconds on the Sun 4/75, while on the 
Cray, the time is 10.5 seconds. When the approximate 
4X scalar speedup of the Cray is considered, it appears that 
the Cray is about eight times slower doing data “conver- 
sion.” No doubt this is due to the Cray’s non-iEEE floating- 
point format. The Convex supports IEEE in hardware, but 
had the same problem as the Cray. This is because the Con- 
vex was lacking IEEE RPC libraries. 

Therefore, while the method is effective, the RPC/XDR 
overhead on the Cray and Convex is far greater than it 
should be, and the realized benefit is significantly less than 
it should be. It is unknown at this time whether significant 
improvements are possible in the efficiency of the Cray 
XDR routines, but clearly the most desirable approach 
from the standpoint of distributed computation is for Cray 
to support IEEE floating-point arithmetic in hardware and 
software. 

23 Which Operations Are Potential Candidates? 

Dense arrays have order 0(N 2 ) elements where N is the 
dimension of the array. Calculations that arc suitable can- 
didates have > 0(N 2 ) operations. For example, matrix 
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multiply using SGEMM, used in this paper, has 0(N 3 ) 
operations. We will examine this case using the number of 
CPU clock ticks per operation. Assume that the number of 
clock ticks per floating-poi nt operation is roughly constant. 
We will denote clock ticks per operation as Kj . Also 
assume that the system overhead of transmitting and con- 
verting an element is roughly constant. This constant, in 
clock ticks per element, is denoted K 2 below. Note that 
SGEMM uses three input matrices and one output matrix, 
requiring the transmission via RPC of four matrices. 
SGEMM actually requires 2A^+3W 2 arithmetic operations. 
Ignoring the N 2 portion of the work, the ratio of work per- 
formed to overhead is approximately: 

^ w*.y 

4 K 2 N 2 2 \ K 7> 

When this ratio is large enough, the operation will be feasi- 
ble. The measured speed for SGEMM on the Cray when 
N = 800 is about 297 MFLOPS. To achieve the goal of 
50 MFLOPS for the remote server, no more than about 
5/6 the total work can be overhead. So, the above ratio can 
be no more than 1/6. By measurement on the Cray, Kj is 
about 0.5618 clock ticks per operation, and K 2 is about 
1134 clock ticks per element, considering user time only. 
Considering both user time and system time, K 2 is about 
1497. Hence, N must be approximately 897 in order to 
achieve the 50-MFLOPS goal. This is a large matrix for the 
workstation to handle locally, and would take several hun- 
dred seconds on most current workstations. 

If the Cray RPC/XDR conversion were about eight times 
faster, as would be expected if the Cray were to use IEEE 
floating-point arithmetic, then K 2 would drop to about 708 
(including system overhead), and the feasible size would 
drop to N = 424. This assumes that the “system” time is 
due to network activity that is independent of floating-point 
format. However, if the non-XDR user and system over- 
head (assumed to be due to network overhead) could be 
reduced to be more in line with the network overhead of a 
typical workstation, then considering the relative scalar 
speeds of the Cray and workstation, the Cray network over- 
head would be about 1/3 of its current value. Then K 2 


would drop to 311, and the feasible matrix size would drop 
to about 186. This size matrix could easily be handled 
locally on a workstation. A production client/server imple- 
mentation would normally be written to do the multiplica- 
tion of small matrices locally. 

Table 2 and figure 1 show the observed MFLOPS for vari- 
ous cases with the remote server running on a Sun 4/75 and 
on a Cray Y-MP8/864. The version of SGEMM optimized 
for workstations by the authors was used on the Sun, and 
the vendor-supplied library version was used on the Cray. 

The interesting thing about these timing results is that RPC 
overhead is small in the case of the Sun; there is little 
change in the overall efficiency over the range of array 
sizes. Conversely, on the Cray, there is a great deal of RPC 
overhead, and consequently there is a large increase in 
performance as array sizes increase. It is inefficient to 
remotely process small arrays on the Cray because of the 
large RPC overhead. On the other hand, there is a large per- 
formance advantage to be gained by using the Cray on 
large, dense arrays. Figure 1 illustrates this. 

Returning to the question of which operations are eco- 
nomic, we are looking for functions, such as matrix multi- 
ply, in which the Operation/Transmission ratio is greater 
than 0(1 ). For example, matrix multiplication and standard 
LU decomposition for dense matrices require OfN 3 ) oper- 
ations and require transmission of 0(N 2 ) elements. In con- 
trast, matrix vector operations require 0(N 2 ) operations 
and transmission of 0(N 2 ) elements. Hence, matrix vector 
operations are not normally suitable for this type of distrib- 
uted application. Other candidates include Fast Fourier 
Transforms (FFTs), since an FFT requires transmission of 
O(N) elements, while requiring 0(Nlog 2 N) operations. 
However, for a single FFT, it is unlikely that the overhead 
would be low enough because log 2 N grows very slowly 
with N. Computational chemistry is an area with excellent 
prospects for distributed applications since some algo- 
rithms are OfN 4 ), or even 0(N 5 ) or greater (ref. 4). 

3. CONCLUSIONS 

The experiment demonstrated that using a distributed 
client-server approach between a workstation and a 


Table 2. Comparison of Cray Y-MP and Sun workstation user MFLOPS 


Array size > 

machine — v 

100 

200 

400 

600 

800 

1000 

Sun 4/75 - mod 

5.0 

5.52 

5.98 

6.15 

6.18 

6.23 

Cray Y-MP8/864 - lib 

6.7 

13.33 

25.1 

36.00 

46.54 

56.02 
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Figure 1. MFLOPS vs N for Cray and Sun servers. 


supercomputer can work today on some array-oriented 
operations, such as matrix multiply, when applied to large, 
dense problems. The inefficiencies of translating between 
IEEE and Cray native format restrict the method to large 
arrays of order N = 900 or greater. If the supercomputer in 
question were to use IEEE floating-point format and had 
networking efficiency comparable to the best performance 
available today in workstations, then the client-server 
approach could be practical for much smaller problems, 
and also a wider variety of problems. It is likely, however, 
that for the foreseeable future, only functions in which the 
ratio of the number of operations to the number of elements 
is at least O(N) will be economic. 

Finally, the results suggest that it might make sense in 
some cases to dedicate a special-purpose computing node 
in a network to function as an array-processor server for 
client machines in the same network. This could be the 
ideal use for a massively parallel machine that might other- 
wise be difficult to use efficiently in a general way with cur- 
rent compilers. With the distributed-application approach 
used here, it would not be necessary for the user to program 


the remote machine or convert any code. Instead, a library 
could be developed such that the user could link the library 
to an existing program, and then the remote computation 
would take place transparently to the user. 
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APPENDIX 


proto . x 


* proto. x — remote matrix procedure protocol description * 

* * 

******************************* *****************************************/ 


struct sgemm_args { 

char transa; 
char transb; 
int m; 
int n; 
int k; 
float alpha; 


float 

a<>; 

/* 

input 

matrix 

a 

*/ 

int 

float 

Ida; 
b<> ; 

/* 

input 

matrix 

b 

*/ 

int 

float 

float 

ldb; 
beta; 
c<> ; 

/* 

input 

matrix 

c 

*/ 


int ldc; 

>; 

struct singlep_matrix { 

float c<>; /* output matrix c */ 

>? 

struct dgemm_args { 

char transa; 
char transb; 
int m; 
int n; 
int k; 
double alpha; 


double a<>; 
int Ida; 

/* 

input matrix a 

*/ 

double b<>; 
int ldb; 

double beta; 

/* 

input matrix b 

*/ 

double c<>; 
int ldc; 

}; 

J* 

input matrix c 

*/ 

struct doublep_matrix { 

double c<>; 

/* 

output matrix c 

*/ 


/* 

program definition 

*/ 


program REMOTE_MATRI X_PROG { 

version REMOTE_MATRIX_VERS { 

int R_SMATRI XALLOC ( int ) » 1? 
int R_SMATRI XFREE ( int ) - 2; 

int R_DMATR I XALLOC ( int ) * 3? 
int R_DMATRI XFREE ( int ) - 4; 

singlep_matrix R_SGEMM ( sgemm_args ) * 5; 
doublep_matrix R_DGEMM ( dgemm_args ) ■ 6; 
> * 1; /* version number */ 

> * 536870912; /* program number 0x20000000 . */ 



server . c 








* server. c - remote procedure that calls Level 3 BLAS 




#inc lude <rpc/rpc.h> 

#include " proto. h" 

#if ndef _CRAY 

# define SGEMM sgemm_ 

# define DGEMM dgemm_ 

#endif 

#ifdef _CRAY 

# define DGEMM SGEMM 
#endif 

static float *smatrix; /* 

static int ssize; /* 

static double * dmatrix; 
static int dsize; f* 

/***★***★ Allocate memory for a single 

int * 

r_smatrixalloc_l ( dimension ) 
int ^dimension; 

{ 

static int status; 


/* generated by rpcgen */ 
/* fix-up function names * / 


ptr to a single precision matrix */ 
number of elements in smatrix */ 
ptr to a double precision matrix */ 
number of elements in dmatrix */ 

precision result matrix ***************/ 


/* Must be static! */ 


ssize m ^dimension; 

smatrix - (float *)malloc{ (unsigned) ( ssize * sizeof ( float) ) ); 


if( smatrix ~= 

0 ) 

status « 0; 

/ * not O.K. 

else 


status « 1; 

/* O.K. */ 


return ( ^status ) ; 

} 

/******** Allocate memory for a double precision result matrix 
int * 

r_dmatrixalloc_l ( dimension ) 
int ^dimension; 




{ 

static int status; 


/* Must be static! */ 


dsize « ^dimension; 

dmatrix - (double *)malloc( (unsigned) (dsize * sizeof (double) ) ); 

if( dmatrix *- 0 ) 

status * 0; /* not O.K. */ 

else 

status *1; /* O.K, */ 

return (fcstatus ) ; 

> 

/****★ free allocated memory for a single precision result matrix ******** / 
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/* Must be static! */ 


int * 

r_smatrixf ree_l ( argp ) 
int *argp; 

{ 

static int status; 

free( (char *) smatrix ); 
status ■ 1; /* O.K. */ 

return ( (status ) ; 

> 

/♦**** Free allocated memory for a double precision result matrix ************/ 
int * 

r_dmatrixfree_l ( argp ) 
int *argp; 

{ 

static int status; /* Must be static! */ 

free( (char *)dmatrix ); 
status ■ 1; /* O.K. */ 

return ( (status ) ; 

> 

/****************************** call SGEMM 
struct singlep_matrix * 
r_sgemm_l ( args ) 

struct sgemin_args *args; 

{ 

static struct singlep_matrix out; 
int i; 

for( i * 0; i < ssize; i++ ) smatrix[i] ■ args->c .c_val [ i ] ; 

SGEMM ( (args->transa, (args->transb, (args->m, (args->n, (args->k, 
(args->alpha, args->a. a_val , (args->lda, 
args->b . b_val , (args->ldb, 

(args->beta # smatrix, (args->ldc }; 

out.c.c_len * ssize; 
out.c.c_val * smatrix; 

return ((out) ; 

> 

/*******★********★************* Call DGEMM 
struct doublep_matrix * 
r_dgemm_l ( args ) 

struct dgemm_args *args; 

{ 

static struct doublep_matrix out; 
int i; 

for( i - 0; i < dsize; i++ ) dmatrix[i] * args->c ,c_val[i] ; 

DGEMM ( ( args ->tr ansa, (args->transb, &args->m, (args->n, (args->k. 


***************★***★★**•***********/ 


/* Must be static! */ 


/ 


/* Must be static! */ 


&args->alpha, 


args->a.a_val, &args->lda, 
args->b.b_val, &args->ldb, 
&args->beta, dmatrix, &args->ldc ) 

out.c.c_len ■ dsize; 
out.c.c_val - dmatrix; 

return (4out ) ; 

> 
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sgemm.c 


sgemm.c 

sgemm_ - procedure to do remote matrix multiplication using SGEMM 
from Level 3 BLAS. 

C :* alpha*op( A ) * op( B ) + beta*C, 

where op( X ) is one of 

op( X ) * X or op ( X ) * X ' , 

and 

alpha and beta are scalars, and A, B, and C are matrices, with op( A ) 
an m by k matrix, op( B ) a k by n matrix, and C an m by n matrix. 


/ 


#include <stdio.h> 

#include <rpc/rpc.h> 

#include "proto. h" 

#ifndef _CRAY 

# define XERBLA xerbla_ 

#endif 

#ifdef _CRAY 

# define sgemm_ SGEMM 
#endif 

#ifdef ultrix 

# include <time.h> 

#endif 

#ifdef sgi 

# include <sys/time.h> 

#endif 

#def ine MAX(a,b) { ( (a)>(b) ) ? (a) : (b) ) 
#def ine MIN(a,b) ( ( (a )< (b) ) ? (a) : (b) ) 


void sgemm_ ( transa, transb, m, n, k, alpha, a, Ida, b, ldb, beta, c, ldc ) 


char * transa; 

/♦addr 

of 

form option for 

matrix op(A) 




*/ 

char * transb; 

/♦addr 

of 

form option for 

matrix op(B) 




*/ 

int *m; 

/♦addr 

of 

number of rows of < 

op( A) and C 




♦/ 

int *n; 

/*addr 

of 

number of columns 

in op(B) and C 



♦/ 

int *k; 

/♦addr 

of 

number of columns 

of op (A) and rows 

of Op(B) 

*/ 

float *alpha; 

/*addr 

of 

scalar alpha 






*/ 

float *a; 

/*addr 

of 

array A 










A( Ida, k) 

if 

transa*’N’ 

or 

' n' 






. A(lda, m) 

if 

transa 1- 

'N' 

or 

'n' 

*/ 

int *lda; 

/*addr 

of 

first dimension 

of 

A 




♦/ 

float *b; 

/*addr 

of 

array B 










B ( ldb, n) 

if 

transb* 1 N ' 

or 

? n' 






B( ldb, k) 

if 

transb i* 

’N' 

or 

'n' 

♦/ 

int *ldb; 

/♦addr 

of 

first dimension 

of 

B 




♦/ 

float *beta; 

/♦addr 

of 

scalar beta 






*/ 

float *c; 

/♦addr 

of 

array C 








C(ldc, n) */ 
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int *ldc 


/‘addr of first dimension of C 


*/ 


static CLIENT *cl - NULL; 
static char *mptr - NULL; 
static char machine [ 256 ] ; 
static struct timeval tv ■ 
static char proc_name [ 7 ] - 


/* client handle 

/* pointer to env. variable "MATRIX_SERVER" 
/* name of server machine 
{ 0, 0 };/* structure to define timeout value 
"SGEMM 


*/ 

*/ 

*/ 

*/ 


int i; 

int note, notb, nrowa, nrowb; 
int ‘status; 
int dummy; 
int csize; 

struct sgemm_args args; 

struct singlep_matrix *matrix_out; 


char *getenv( ) ; 

CLIENT *clnt_create ( ) ; 
void timeproc ( ) ; 


/******************* 
if( (strncmp(transa, 
nota * TRUE; 
else 

nota “ FALSE; 


Check input parameters ********************************* 

"N",!)— 0) || (strncmp(transa,"n",l)--0) ) 


/ 


if ( (strncmp(transb, "N” , 1 ) mm 0) || (strncmp(transb, "n",l)--0) ) 

notb - TRUE; 
else 

notb ■ FALSE; 


if( nota«TRUE ) nrowa - *m; 

else nrowa - *k; 

if { notb«TRUE ) nrowb • *k; 

else nrowb * *n; 


0 ; 


if ( I nota 

it (strncmp(transa, "T", 1) 1-0) it (strncmp(tranBa, "t",l) 1-0) 
it {strncmp(transa, "C”, 1)1-0) it (strncmp(transa, "c" , 1) 1-0) ) 

i » l; 

else if ( Inotb 

it (strncmp(transb,"T",l)!-0) && (strncmp(transb, "t", 1) »-0) 
it (strncmp(transb, "C",l) 1-0) it (strncmp( transb, 'c', 1 ) 1-0) ) 


i - 2; 

else if ( *m < 0 ) 
i“ 3; 

else if ( *n < 0 ) 
i * 4; 

else if ( *k < 0 ) 
i - 5; 

else if ( ‘Ida < MAX(1, nrowa) ) 
i - 8; 

else if ( *ldb < MAX (1, nrowb) ) 
i - 10; 

else if ( *ldc < MAX(l,*m) ) 
i - 13; 
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/* call the Level 2 BLAS error routine * I 


if ( i l- 0 ) { 

XERBLA ( proc_name, fci ); 
return; 

> 


/********************* Quick return if possible. ***************************/ 

if( ( *m— 0 ) || ( *n«0 ) || 

( ( ( *alpha«0.0 ) || ( *k— 0 ) ) && ( *beta-«1.0 ) ) ) return; 

/****************** G e t the remote server machine **************************/ 
mptr * getenv( "MATRIX_SERVER" ); 
if ( mptr i« NULL ) { 

strcpy( machine, mptr )? 

> 

else { 

fprintf ( stderr , "Error i environment variable MATRIX_SERVER is not set to\ 
the name of the server machine . \n" ) ; 
exit ( 1 ) ; 

> 


/***♦* Create client handle for calling REMOTE_MATRIX_PROG using "tcp" *****/ 
if ( cl »« NULL ) { 

cl - clnt_create ( machine, REMOTE_MATRI X_PROG , REMOTE_MATRIX_VERS , 

"tcp" ) ; 

if( cl ■« NULL ) { /* Couldn't establish server connection */ 

clnt_pcreateerror{ machine ); 
exit ( 1 ) ; 

> 

> 


/************** Allocate space in server for result matrix: *****************/ 


tv.tv_sec - 25; 

clnt_control ( cl, CLSET_TIMEOUT , itv ); 

/* 

/* timeout in seconds 
set client timeout value 

*/ 

*/ 

csize * *ldc * *n ; 

status ■ r_smatrixalloc_l ( fccsize, cl ); 

/* 

allocate space in server 

*/ 


if ( status «* NULL ) { 

clnt_perror (cl , "RPC error from calling r_smatrixalloc_l" ) ; 
exit ( 1 ) ; 


> 

if ( *8tatus 0 ) { 
fprintf ( stderr, 

"Errors Could not allocate memory for result matrix on %s.\n", 
machine ) ; 
exit ( 1 ) ; 

> 

/★************* call remote procedure to multiply matrices: *****************/ 
tv.tv_sec ■ 2 * (25. + l.E-6 * ( 2 . * *m * *k * *n ) ) ; /* timeout in secs */ 

clnt_control( cl, CLSETJTIMEOUT, ttv ); /* set client timeout value */ 


args . transa 
args . transb 
args .m 
args *n 
args . k 
args .alpha 
if ( nota — 


= *transa; 
‘ *transb; 
E *m; 

: *n ; 

*k ; 

f * alpha; 
TRUE ) 



args.a.a_len - *lda * 

/* 

op( A ) * A 

*/ 

else 




args.a.a_len * *lda * *m; 

/* 

op( A ) - A' 

*/ 


args.a.a_val * a; 
args.lda ■ *lda; 

if ( notb ■■ TRUE ) 


args .b.b_len 
else 

■ *ldb * *n; 

/* 

op( 

B ) 

- B 

*/ 

% 

args ,b*b_len 
args .b.b__val ■ 

- *ldb * *k; 
b; 

/* 

op( 

B ) 

- B ' 

*/ 



args.ldb ■ *ldb; 

args.beta » *beta; 
argSsC.c_len - csize; 
args.c*c_val ■ c; 
args.ldc - *ldc; 

matrix_out * r_sgemm_l( large, cl ) ? /* multiply matrices */ 

if ( matrix_out ■« NULL ) { 

clnt_perror( cl, "RPC error from calling r_Bgemm_l" ); 
exit { 1 ) ; 

} 

for< i ■ 0; i < matrix_out->c ,c_len; i++ ) { 
c[i) - matrix_out->c.c_val[ i] ; 

> 

/★******************** Free allocated server memory; ********************•***/ 
tv.tv_Boc - 25; /* timeout in seconds */ 

clnt_control ( cl, CLSET_TIMEOUT , itv ); /* set client timeout value */ 

status = r_smatrixf ree_l ( & dummy , cl ); /* free allocated space */ 

if( *8tatus »■ NULL ) { 

clnt_perror {cl, “RPC error from calling r_Bmatrixfree_l" ) ; 
exit ( 1 ) ; 

> 

return ; 

> 
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